Search CORE

12 research outputs found

Generative Pre-training for Speech with Flow Matching

Author: Hsu Wei-Ning
Le Matt
Liu Alexander H.
Shi Bowen
Tjandra Andros
Vyas Apoorv
Publication venue
Publication date: 24/10/2023
Field of study

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.Comment: Preprint, under revie

arXiv.org e-Print Archive

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Author: Adi Yossi
Hsu Wei-Ning
Karrer Brian
Le Matthew
Mahadeokar Jay
Manohar Vimal
Moritz Rashel
Sari Leda
Shi Bowen
Vyas Apoorv
Williamson Mary
Publication venue
Publication date: 23/06/2023
Field of study

Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. See voicebox.metademolab.com for a demo of the model

arXiv.org e-Print Archive

Scaling Speech Technology to 1,000+ Languages

Author: Adi Yossi
Auli Michael
Babu Arun
Baevski Alexei
Conneau Alexis
Elkahky Ali
Fazel-Zarandi Maryam
Hsu Wei-Ning
Kundu Sayani
Ni Zhaoheng
Pratap Vineel
Shi Bowen
Tjandra Andros
Tomasello Paden
Vyas Apoorv
Zhang Xiaohui
Publication venue
Publication date: 22/05/2023
Field of study

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data

arXiv.org e-Print Archive

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Author: Fleuret Francois
Katharopoulos Angelos
Pappas Nikolaos
Vyas Apoorv
Publication venue: Idiap
Publication date: 23/07/2020
Field of study

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input’s length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from O(N^2) to O(N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequence

Infoscience - École polytechnique fédérale de Lausanne